Yesterday, Prashanth Govindarajan posted an article about DataFrame on the .NET Blog. I got excited and wanted to try the library as soon as I could. In this post, I will explain what the library is intended for and what are my thoughts on it.
The package
The DataFrame related classes were introduced in the package Microsoft.Data.Analysis. The source code is available in the corefxlab repository, Microsoft.Data.Analysis.
This means that the package is experimental and is currently in preview.
Installing
As with all NuGet packages, you can install it either by using the Visual Studio NuGet Package Manager or using the CLI: dotnet add package Microsoft.Data.Analysis
.
Note: You can also install the package in a .NET Jupyter Notebook.
Functionalities
- Loading data from CSV (not mature enough).
- Easy column/row selection.
- Join/Merge DataFrames.
- GroupBy.
- Handle Null values.
- And others…
Example
Creating Data
In this example, we’ll be creating data instead of loading it from a CSV, since DataFrame.LoadCsv
doesn’t handle null values and all structs (e.g. DateTime
) yet.
string[] names = { "John", "Ahmed", "Chris", "Albert" };
int[] salaries = { 20000, 30000, 40000, 10000 };
DateTime[] birthdays = { DateTime.Parse("23/4/1990"), DateTime.Parse("4/5/1982"),
DateTime.Parse("2/1/1980"), DateTime.Parse("9/10/1994") };
string[] departments = { "Development", "Development", "HR", null };
var idColumn1 = new PrimitiveDataFrameColumn<int>("Id", new int[] { 1, 2, 3, 4 });
var nameColumn = new StringDataFrameColumn("Name", names);
var birthdayColumn = new PrimitiveDataFrameColumn<DateTime>("Birthday", birthdays);
var salaryColumn = new PrimitiveDataFrameColumn<int>("Salary", salaries);
var idColumn2 = new PrimitiveDataFrameColumn<int>("EmployeeId", new int[] { 3, 1, 2, 4 });
var departmentColumn = new StringDataFrameColumn("Department", departments);
var employeesDf = new DataFrame(idColumn1, nameColumn, birthdayColumn, salaryColumn);
var departmentsDf = new DataFrame(idColumn2, departmentColumn);
The example data is very simple, we have employees (id, name, birthday and salary) and which department they belong to.
Note: This code would change to 2 lines when LoadCsv
works properly.
DataFrame information and description
employeesDf.PrettyPrint();
Id | Name | Birthday | Salary |
---|---|---|---|
1 | John | 23/04/1990 | 20000 |
2 | Ahmed | 04/05/1982 | 30000 |
3 | Chris | 02/01/1980 | 40000 |
4 | Albert | 09/10/1994 | 10000 |
departmentsDf.PrettyPrint();
EmployeeId | Department |
---|---|
3 | Development |
1 | Development |
2 | HR |
4 | null |
employeesDf.Info().PrettyPrint();
Info | Id | Name | Birthday | Salary |
---|---|---|---|---|
DataType | System.Int32 | System.String | System.DateTime | System.Int32 |
Length (excluding null values) | 4 | 4 | 4 | 4 |
departmentsDf.Info().PrettyPrint();
Info | EmployeeId | Department |
---|---|---|
DataType | System.Int32 | System.String |
Length (excluding null values) | 4 | 3 |
I would’ve hoped to see more information in Info
, for example the memory usage like in pandas.
employeesDf.Description().PrettyPrint();
Description | Id | Birthday | Salary |
---|---|---|---|
Length (excluding null values) | 4 | 4 | 4 |
Max | 4 | null | 40000 |
Min | 1 | null | 10000 |
Mean | 2.5 | null | 25000 |
PrettyPrint
DataFrame.ToString
only takes into account the maximum length of columns when printing the dataframe, which doesn’t always work since some column values might be longer.
I created a function that does the opposite to get some good looking tables, the code is available at the end of the post.
Fill nulls with a value
departmentColumn.FillNulls("Other", inPlace: true);
In the future, it would be nice to have more ways to fill nulls, for example the most present value (in this case Development).
Creating a column from an existing one by applying a function
var currentYear = DateTime.Now.Year;
employeesDf["Age"] = birthdayColumn.Apply(d => currentYear - d.Year);
employeesDf.PrettyPrint();
For now, DataFrame.Apply
only handle same-type function, meaning that if your colum is of type DateTime
, the output needs to be the same, which doesn’t work for our example (and for other use cases).
I opened an issue in the repo (here) and I’m working on an PR, for now, I’m using a custom extension method that will be available at the end of the post too.
Id | Name | Birthday | Salary | Age |
---|---|---|---|---|
1 | John | 23/04/1990 00:00:00 | 20000 | 29 |
2 | Ahmed | 04/05/1982 00:00:00 | 30000 | 37 |
3 | Chris | 02/01/1980 00:00:00 | 40000 | 39 |
4 | Albert | 09/10/1994 00:00:00 | 10000 | 25 |
Normalize a column
Note: This example intends to show how the columns can handle operations.
var minSalary = (float)(int)salaryColumn.Min();
var maxSalary = (int)salaryColumn.Max();
employeesDf["NormalizedSalary"] = (salaryColumn - minSalary) / (maxSalary - minSalary);
employeesDf.PrettyPrint();
Id | Name | Birthday | Salary | Age | NormalizedSalary |
---|---|---|---|---|---|
1 | John | 23/04/1990 00:00:00 | 20000 | 29 | 0.3333333333333333 |
2 | Ahmed | 04/05/1982 00:00:00 | 30000 | 37 | 0.6666666666666666 |
3 | Chris | 02/01/1980 00:00:00 | 40000 | 39 | 1 |
4 | Albert | 09/10/1994 00:00:00 | 10000 | 25 | 0 |
Join 2 DataFrames
var df = employeesDf.Merge<int>(departmentsDf, "Id", "EmployeeId", joinAlgorithm: JoinAlgorithm.Inner);
df.PrettyPrint();
Id | Name | Birthday | Salary | Age | NormalizedSalary | EmployeeId | Department |
---|---|---|---|---|---|---|---|
3 | Chris | 02/01/1980 00:00:00 | 40000 | 39 | 1 | 3 | Development |
1 | John | 23/04/1990 00:00:00 | 20000 | 29 | 0.3333333333333333 | 1 | Development |
2 | Ahmed | 04/05/1982 00:00:00 | 30000 | 37 | 0.6666666666666666 | 2 | HR |
4 | Albert | 09/10/1994 00:00:00 | 10000 | 25 | 0 | 4 | Other |
I’ll have to admit that I got confused at first, thinking that Join
should do this while Merge
would only merge without a condition.
Drop a column and sort by a column
df.Columns.Remove("EmployeeId");
df = df.Sort("Id");
df.PrettyPrint();
Id | Name | Birthday | Salary | Age | NormalizedSalary | Department |
---|---|---|---|---|---|---|
1 | John | 23/04/1990 00:00:00 | 20000 | 29 | 0.3333333333333333 | Development |
2 | Ahmed | 04/05/1982 00:00:00 | 30000 | 37 | 0.6666666666666666 | HR |
3 | Chris | 02/01/1980 00:00:00 | 40000 | 39 | 1 | Development |
4 | Albert | 09/10/1994 00:00:00 | 10000 | 25 | 0 | Other |
Mean salary by department
var employeesByDepartment = df.GroupBy("Department");
employeesByDepartment.Mean("Salary").PrettyPrint();
Department | Salary |
---|---|
Development | 30000 |
HR | 30000 |
Other | 10000 |
Value counts of department
departmentColumn.ValueCounts().PrettyPrint();
Values | Counts |
---|---|
Development | 2 |
HR | 1 |
Other | 1 |
Sample rows
df.Sample(2).PrettyPrint();
Note: This method can give duplicate rows. I created an issue and hopefully it’ll be fixed next preview.
Conclusion
I’m excited to see what this library becomes in the future. It still needs a lot of work but I hope it gets mature enough to get out of the lab and be officially supported, it can be a nice addition to .NET.
If you’re interested in the code used in this post:
See you soon!